Take-home Exercise 3 - MC3

Author

Shaun Tan

Published

June 3, 2023

Modified

June 18, 2023

1. The Task

1.1 Background

The country of Oceanus has sought FishEye International’s help in identifying companies possibly engaged in illegal, unreported, and unregulated (IUU) fishing. As part of the collaboration, FishEye’s analysts received import/export data for Oceanus’ marine and fishing industries. However, Oceanus has informed FishEye that the data is incomplete. To facilitate their analysis, FishEye transformed the trade data into a knowledge graph. Using this knowledge graph, they hope to understand business relationships, including finding links that will help them stop IUU fishing and protect marine species that are affected by it. FishEye analysts found that node-link diagrams gave them a good high-level overview of the knowledge graph. However, they are now looking for visualizations that provide more detail about patterns for entities in the knowledge graph

1.2 The Task in detail

Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.

2. Data Prep

2.1 Loading the requisite R libraries and importing the JSON file:

Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, gralayouts, ggforce, tidytext, tidyverse, skimr, patchwork, ggdist, ggridges, ggthemes, scales)
Show the code
mc3_data <-
  jsonlite::fromJSON("data/MC3.json")

2.2 Extracting Edges

Show the code
mc3_edges <-
  as_tibble(mc3_data$links) %>%
    distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
  summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()

2.3 Extracting Nodes

Show the code
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
# distinct() %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id, country, type, revenue_omu, product_services)

3. Initial Data Exploration

3.1 Exploring Edges

3.1.1 Edges from the mc3_edges dataframe are explored using the skim and datatable functions.

Show the code
skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

The are no missing variables in the mc3_edges dataframe.

We next take a glimpse of the first few lines of the dataframe using the following datatable.

Show the code
DT::datatable(mc3_edges)

This dataframe depicts the connections between companies and company contacts/beneficial owners.

3.1.2 Bar Chart of Type of Edges

To further explore the count of each type of contact, we plot a bar chart of the type of edges in the mc3_edges dataframe.

Show the code
ggplot(data = mc3_edges,
       aes(x = type)) +
  geom_bar() +
  geom_text(
    aes(label = after_stat(count)),
    stat = "count",
    vjust = -2
  ) + 
  ylim (0,18000) + 
  labs(title = "Count of type of edges in mc3_edges dataframe", y= "Count", x = "Type of Edges" )

There are significantly more beneficial owners (16792) vs company contacts (7244)

3.2 Exploring the nodes data frame

3.2.1 Nodes from the mc3_nodes dataframe are explored using the skim and datatable functions.

Show the code
skim(mc3_nodes)
Data summary
Name mc3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁

There are 21515 missing values in the revenue_omu column. Given the importance of this data, and the huge number of missing values, it rouses suspicion and will be explored in greater detail subsequently.

Show the code
DT::datatable(mc3_nodes)

3.2.2 Bar Chart of Type of Nodes

Next, a bar chart of the type of nodes is plotted.

Show the code
ggplot(data = mc3_nodes,
       aes(x = type)) +
  geom_bar() +
  geom_text(
    aes(label = after_stat(count)),
    stat = "count",
    vjust = -2
  ) + ylim (0,13000) + labs(
    title = "Count of type of nodes in mc3_nodes dataframe", y= "Count", x = "Type" )

3.2.3 Bar chart of Country of Origin

Next, we explore explore Country of Origin of the nodes for interesting relationships, comparing between Companies vs. Company contacts vs. Beneficial owners.

Show the code
nodes_country <- mc3_nodes %>%
  group_by(country, type) %>%
  summarise(count = n()) %>%
  ungroup()
Show the code
plot_company <- nodes_country %>%
  filter(type == "Company" &
           count > 150) %>%
  ggplot(aes(x = reorder(country, -count), y = count)) +
  geom_col() +
  ylim(0,4000) +
  geom_text(
    aes(label = count),
    vjust = -2
  ) +  
  labs(
    title = "Count of Company's Country of Origin", y= "Count", x = "Country", subtitle = "Companies predominantly from ZH, Oceanus, and Marebak"
  )

plot_owner <- nodes_country %>%
  filter(type == "Beneficial Owner") %>%
  ggplot(aes(x = reorder(country, -count), y = count)) +
  geom_col() +
  ylim(0,14000) +
  geom_text(
    aes(label = count),
    vjust = -2
  ) +  
  labs(
    title = "Count of Beneficial Owner's Country of Origin", y= "Count", x = "Country", subtitle = "Beneficial Ownders predominantly from ZH"
  )

plot_contacts <- nodes_country %>%
  filter(type == "Company Contacts") %>%
  ggplot(aes(x = reorder(country, -count), y = count)) +
  geom_col() +
  ylim(0,8000) +
  geom_text(
    aes(label = count),
    vjust = -2
  ) +  
  labs(
    title = "Count of Company Contacts' Country of Origin", y= "Count", x = "Country", subtitle = "Company Contact predominantly from ZH"
  ) 

plot_company / plot_owner / plot_contacts

Note

Despite most of the owners and and company contacts orginating from ZH, there count of country of origin of the company is more diverse, with ZH, Oceanus, and Marebak taking the top 3 spots. It is indicative of owners venturing out of their own countries to set up companies in other countries.

3.2.4 Distribution of Revenue of Companies

Revenue is postulated to provide substantial information on the company, and would be instrumental in providing clues to any anomalous behaviours. A boxplot with frequency dot plot overlaid is plotted to visualise the distribution.

Show the code
company_revenue <- mc3_nodes %>% 
  filter(type == "Company")

ggplot(company_revenue, 
       aes(y = revenue_omu)) +
 scale_y_continuous(
    limits = c(0, 200000),
    breaks = pretty_breaks(n = 5),
    labels = dollar_format())+
  geom_boxplot(width = 0.5,
               outlier.shape = NA, color = 'darkred') +
  stat_dots(color = 'blue') +
  coord_flip() + 
  labs(
    title = "Distribution of Revenue of Companies", y= "Revenue", x = "Count", subtitle = "Highly right skewed distribution of companies' revenue"
  )

The chart above shows that a majority of the companies make less than $25,000 in revenue and the distribution of the revenue of companies os right skewed.

3.2.5 Relationship between Companies and Number of Owners

Getting the number of owners each company has:

Show the code
edges_by_target <- mc3_edges %>%
  filter(type == 'Beneficial Owner') %>%
  group_by(source, type) %>%
  summarise(owner_count = n())%>%
  arrange(desc(owner_count))%>%
  ungroup()

Exploring the datatable of edges by companies:

Show the code
DT::datatable(edges_by_target)

The table above displays the number of owner each company has. It is postulated that companies with multiple beneficial owners has the oversight of many people, it is unlikely to be engaged in dubious activities, whereas companies which are sole proprietorships are at the bidding of that single beneficial owner.

Show the code
owner_count_df <- edges_by_target %>%
  group_by(owner_count) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  ungroup()
Show the code
filtered_data <- owner_count_df %>%
  filter(owner_count <= quantile(owner_count, 0.2))

ggplot(filtered_data, aes(x = factor(owner_count), y = count)) +
  geom_col() +
  ylim(0,7000) +
  geom_text(
    aes(label = count),
    vjust = -1,
    size = 3
  ) +  
  labs(
    title = "Count of Companies by number of owners", y= "Count of Companies", x = "Number of Owners", subtitle = "Majority of companies have only one owner"
  )

The above graph shows that 6415 companies are sole proprietorships and form the majority of the companies.

3.2.6 Conclusion of EDA

Note

Initial Sensing:

  1. Beneficial owners who are sole owners of multiple companies wield disproportionate influence with having more autonomy without the scrutiny of other beneficial owners.

  2. Companies with unreported revenue yet with many company contacts - shows that the company is extensive, and could be reporting high revenue, yet the revenue is unaccounted for.

  3. Given that the dataset contains companies engaged in many other types of products/services besides fish-related services, we should narrow down the search to just fish related companies to further explore anomalous behaviours.

3. Network Visualisation and Analysis

3.1 Point of Interest 1:

The first anomalous behaviour that will be investigated would be sole beneficial owners of multiple companies, with companies having higher revenue being more suspicious. The reason being, there is no need for transparency and being accountable to shareholders for these companies. As such, they are less deterred from pursuing illegal activities given the lesser oversight.

Wrangling the edges and nodes dataframe to enable the creation of a network graph.

Show the code
single_bowners <- mc3_edges %>%
  filter(type == "Beneficial Owner") %>%
  distinct(source, .keep_all = TRUE) 

single_bowner_count <- single_bowners %>%
  group_by(target) %>%
  mutate(count = n()) %>%
  filter(count >= 4) %>%
  ungroup()
Show the code
single_bowner_count_revenue <- left_join(single_bowner_count, mc3_nodes, by = c("source"="id")) %>%
  select(-type.y) %>%
  rename("type" = "type.x")

single_bowner_count_revenue1 <-  single_bowner_count_revenue %>%
  distinct() %>%
  rename("from" = "source",
         "to" = "target")

bowner_source <- single_bowner_count_revenue1 %>%
  distinct(from) %>%
  rename("id" = "from")

bowner_target <- single_bowner_count_revenue1 %>%
  distinct(to) %>%
  rename("id" = "to")

bowner_nodes_extracted <- rbind(bowner_source, bowner_target)

bowner_nodes_extracted$group <- ifelse(bowner_nodes_extracted$id %in% single_bowner_count_revenue$source, "Company", "Beneficial Owner")

Creating the visNetwork Graph

Show the code
visNetwork(
    bowner_nodes_extracted, 
    single_bowner_count_revenue1
  ) %>%
  visIgraphLayout(
    layout = "layout_with_fr"
  ) %>%
  visGroups(groupname = "Company",
            color = "lightblue") %>%
  visGroups(groupname = "Beneficial Owner",
            color = "yellow") %>%
  visLegend() %>%
  visEdges(
    arrows = "to"
  ) %>%
  visOptions(
    highlightNearest = list(enabled = T, degree = 2, hover = T),
    nodesIdSelection = TRUE,
    selectedBy = "group",
    collapse = TRUE)
Note

With the possibility of lesser oversight, there is the chance that these companies may be participating in suspicious activity. However, it should be noted that the size as well and the fidelity of the nature of business of these companies is not available on the network graph. It should therefore be investigated in greater detail, before a solid conclusion can be formed. However, in the meantime, it is the exception and not the norm and should be monitored.

3.2 Point of Interest 2

The next suspicious behaviour that deserves investigating would be companies with many company contacts, that have missing revenue reported. There is a chance that these companies are in fact bring in substantial revenue but have undeclared their revenue.

Wrangling the edges and nodes dataframe to enable the creation of a network graph.

Show the code
# Extract nodes that have unreported revenue
nodes_norev <- mc3_nodes %>%
  filter(is.na(revenue_omu))

nodes_norev_compcontact <- nodes_norev %>%
  filter(type == "Company Contacts") %>%
  distinct()

# Extracting edges that are company contacts
edges_norev <- mc3_edges %>%
  filter(type == "Company Contacts") %>%
  filter(source %in% nodes_norev_compcontact$id) %>%
  distinct() %>%
  rename("from" = "source",
         "to" = "target")

# Extract edges that have more than or equal to 3 company contacts
edges_norev_high <- edges_norev %>%
  group_by(from) %>%
  mutate(count = n()) %>%
  filter(count >= 3) %>%
  ungroup()
Show the code
# Get distinct Source and Target
norev_source <- edges_norev_high %>%
  distinct(from) %>%
  rename("id" = "from")

norev_target <- edges_norev_high %>%
  distinct(to) %>%
  rename("id" = "to")
Show the code
# Bind into single dataframe
nodes_norev1 <- bind_rows(norev_source, norev_target)

nodes_norev1$group <- ifelse(nodes_norev1$id %in% nodes_norev_compcontact$id, "Company Contact", "Company")

Creating the visNetwork graph

Show the code
visNetwork(
    nodes_norev1, 
    edges_norev_high
  ) %>%
  visIgraphLayout(
    layout = "layout_with_fr"
  ) %>%
  visGroups(groupname = "Company",
            color = "lightblue") %>%
  visGroups(groupname = "Company Contact",
            color = "yellow") %>%
  visLegend() %>%
  visEdges(
    arrows = "to"
  ) %>%
  visOptions(
    highlightNearest = list(enabled = T, degree = 2, hover = T),
    nodesIdSelection = TRUE,
    selectedBy = "group",
    collapse = TRUE)
Note

It is indeed suspicious that so many large companies have unreported revenue. The potulated size, given the lack of revenue data, can only be extrapolated using the number of contacts, and the number of beneficial owners. It should be noted that the number of contacts of these top few companies are similar to those of the top few companies with reported revenue. This should be investigated in further details with other forms of proxy information obtained and pieced together to determine what they can possibly be hiding.

3.3 Point of Interest 3

The last visual that i would be using to explore specifically fish-related anomalous behaviour would be the networks of biggest company-beneficial owner relationships of fish-related businesses. The reason that this is done is to compare and understand the typical network size of a fish-related business in terms of number of beneficial owners, and compare it with the industry standard.

Wrangling the edges and nodes dataframe to enable the creation of a network graph.

Show the code
# Extract nodes that are fish-related
fish_nodes <- mc3_nodes %>%
  filter(grepl("fish", product_services, ignore.case = TRUE))

fish_nodes_bowners <- fish_nodes %>%
  filter(type == "Beneficial Owners") %>%
  distinct()

fish_nodes_companies <-fish_nodes %>%
  filter(type == "Company") %>%
  distinct()

# Extract edges that are fish related
edges_fish <- mc3_edges %>%
  filter(type %in% c("Company", "Beneficial Owner")) %>%
  filter(source %in% fish_nodes$id) %>%
  distinct() %>%
  rename("from" = "source",
         "to" = "target")

# Extract edges that have more than or equal to 8 links
edges_fish_high <- edges_fish %>%
  group_by(from) %>%
  mutate(count = n()) %>%
  filter(count >= 8) %>%
  ungroup()
Show the code
# Get distinct Source and Target
fish_source <- edges_fish_high %>%
  distinct(from) %>%
  rename("id" = "from")

fish_target <- edges_fish_high %>%
  distinct(to) %>%
  rename("id" = "to")
Show the code
# Bind into single dataframe
nodes_fish1 <- bind_rows(fish_source, fish_target)

nodes_fish1$group <- ifelse(nodes_fish1$id %in% fish_nodes_companies$id, "Company", "Beneficial Owner")

Creating the visNetwork Graph

Show the code
visNetwork(
    nodes_fish1, 
    edges_fish
  ) %>%
  visPhysics(solver = "forceAtlas2Based",
               forceAtlas2Based = list(gravitationalConstant = -100)) %>%
  visIgraphLayout(
    layout = "layout_with_fr"
  ) %>%
  visGroups(groupname = "Company",
            color = "yellow") %>%
  visGroups(groupname = "Beneficial Owner",
            color = "lightblue") %>%
  visLegend() %>%
  visEdges(
    arrows = "to"
  ) %>%
  visOptions(
    highlightNearest = list(enabled = T, degree = 2, hover = T),
    nodesIdSelection = TRUE,
    selectedBy = "group",
    collapse = TRUE)
Note

It is interesting to note that there are no personnel that are beneficial owners of more than 1 companies for these “large” or extensive companies. There is a possibility that they do not want to have multiple owners for fear of a conflict of interest or corporate espionage. While this is not necessarily anomalous behaviour, it is an interesting point to note, and it may even be representative of the different cartels in the fish industry.

4. Conclusion

Exploring the networks between the various type of nodes or players in the space has been useful to visualising the relationships between the different parties. It has yielded interesting insights on how certain companies may wield more autonomy, or how certain companies, despite being expansive, may be hiding behind unreported revenues.

For future work, the additional column of product services can be sorted and analysed using text analytics methods, to provide an additional layer of to the overall visualisation of networks and information in this project.